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NEURAL  NETWORK  CONSTRUCTION  USING  EVOLUTIONARY  SEARCH 


J.R.  McDonnell  and  W.C.  Page 
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D.E.  Waagen 

TRW  Systems  Integration  Group 
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ABSTRACT 

This  work  investigates  the  application  of  evolutionary  search  for  training  candidate 
hidden  units  in  cascade-correlation  learning  architectures.  A  hybrid  evolutionary  search 
algorithm  which  implements  techniques  from  evolutionary  programming  and  evolution 
strategies  is  proposed.  This  approach  is  evaluated  on  selected  low-dimensional 
examples  which  are  nonlinearly  separable. 

1.  Introduction 

By  virtue  of  their  function  approximation  capabilities,  feedforward  artificial  neural 
networks  have  lent  themselves  well  to  applications  in  statistical  pattern  classification  and 
nonlinear  system  mapping.  Given  the  desired  mapping  (I/O)  pairs  x  eRm  and  y  e 

91”  ,  a  feedforward  neural  network  can  be  characterized  by  the  triple  JV(W,C,F)  which 
implements  the  mz$fm%f(x\W,C,F)  :Rm-rRn  where  W  parameterizes  the  network  weights, 
C  (strongly)  specifies  the  network  connectivity,  and  F  represents  the  activation  functions 
corresponding  to  each  node.  The  conventional  approach  in  most  applications  is  to 
arbitrarily  fix  the  architecture  by  defining  the  number  of  hidden  units  and  the  number  of 
layers  and  defaulting  to  full  connectivity  between  layers.  A  sigmoidal  activation  function 
such  as  the  g(u)=I/(I+e'u)  normally  serves  as  the  de  facto  nonlinear  mapping  of  each 
node  in  the  network. 

Given  the  above  constraints,  the  standard  network  design  approach  becomes  a 
nonlinear  regression  problem  which  seeks  to  determine  a  weight  set  w  that  minimizes  an 
arbitrarily  selected  objective  function  J(x,w).  The  most  commonly  used  objective  function 
is  the  sum-squared  network  error  which  satisfies  the  continuity  requirements  of  gradient 
search  methods  such  as  backpropagation1.  The  traditional  design  approach  can  become  a 
tedious  and  laborious  undertaking  if  an  acceptable  objective  criterion  (i.e.,  J  <  e)  is  not 
met  due  to  inadequate  network  representation  or  learning. 

The  cascade-correlation  learning  architecture2  (CCLA)  has  been  proposed  as 
constructive  method  that  automates  the  network  design  process.  This  investigation 
modifies  the  traditional  cascade-correlation  training  algorithm  by  using  evolutionary 
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search  instead  of  quickprop3  to  train  each  candidate  hidden  node.  At  the  end  of  each 
evolutionary  cycle,  the  best  'evolved1  candidate  node  is  incorporated  into  the  network. 

The  following  sections  discuss  the  CCLA  and  our  hybrid  approach  for 
evolutionary  optimization.  The  application  of  evolutionary  search  to  the  CCLA  is  then 
described.  Finally,  results  are  discussed  for  the  parity  and  spiral  problems. 


1. 1  Cascade-Correlation  Architectures 


Cascade-correlation  architectures  were  introduced  by  Fahlman  and  Lebiere2  as  a 
means  of  automatically  determining  parsimonious  network  structure  given  a  particular 
data  set.  The  network  is  initiallized  with  only  the  input  units  mapped  directly  to  the 
output  units.  New  hidden  units  are  individually  incorporated  into  the  network  with  input 
weights  frozen.  The  input  weights  to  each  hidden  unit  are  frozen  since  newly  created 
nodes  have  been  trained  to  maximize  the  correlation  between  their  output  and  the  residual 
output  error  of  the  network.  This  is  the  correlation  aspect  of  the  cascade-correlation 
architecture.  A  pool  of  hidden  units  can  be  used  to  increase  the  likelihood  of  finding  a 
good  candidate  unit.  Each  new  hidden  unit  is  connected  to  all  input  and  previously 
created  hidden  units  in  the  network  as  shown  in  Figure  1.  All  of  the  hidden  units  are 
connected  to  all  of  the  output  units.  After  each  hidden  unit  is  incorporated  into  the 
network,  additional  training  takes  place  on  the  weights  to  the  output  layer  with  the 
weights  to  the  hidden  units  remaining  fixed.  Squires  and  Shavlik4  have  shown  that  faster 
training  times  and  better  generalization  can  sometimes  result  if  all  of  the  weights  in  the 
network  are  trained  using  backpropagation. 

The  hidden  and  output  units  achieve  the  following  mappings 


hidden  unit. 


output  unit 


x. 


m  <  i  <m+h  (1) 


m  +h<  i  <m+h+n  (2) 


where  h  represents  the  number  of  hidden  units,  m  the  number  of  inputs,  and  n  the  number 
of  outputs.  The  cascade-correlation  architecture  appears  quite  similar  to  the  projection 
pursuit  technique5  when,  assuming  linear  outputs,  equation  (2)  is  broken  down  into  its 
linear  and  nonlinear  components 


m+h 

xi=aju  +  ^aijg).(a),u) 

j=m+ 1 


2 


Figure  1.  The  cascade-correlation  network  structure.  The  boxes  indicate  weights  which  are  frozen  when 
the  unit  is  incorporated  into  the  network.  The  crosses  indicate  weights  which  are  modified  after  a  new 

hidden  unit  is  incorporated. 


where  «=*,-  V  ie{l . m}  and  V  je{l . m}.  However,  this  view  oversimplifies  the 

repeated  nonlinear  transformations  which  actually  occur  as  described  by 


1.2  Evolutionary  Search 

In  1958,  Brooks6  described  a  creeping  random  method  where  k  points  were 
generated  via  Gaussian  perturbations  about  a  search  point.  The  best  point  was  kept  and 
the  process  repeated.  Brooks  observed  that  there  are  some  rather  intriguing  analogies 
that  can  be  made  between  the  creeping  random  method  and  evolution."  This  analogy  was 
also  apparent  to  Fogel  el  al1  who  proposed  a  random  search  strategy  termed  evolutionary 
programming  (EP).  Fogel  et  al1  proposed  the  following: 

"parent"  organism  is  scored  in  terms  of  its  ability  to  accomplish  the  desired  decision 
making  on  the  basis  of  evidence  at  hand.  The  organism  is  mutated  to  yield  an 
"offspring"  which  is  given  the  same  task  and  scored  in  a  similar  manner.  That  organism 
which  demonstrated  the  greatest  ability  to  perform  the  required  function  is  retained  to 
serve  as  parent  for  a  new  offspring. 
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More  recently,  D.  Fogel8-9  has  refined  the  EP  paradigm  as  well  as  applied  it  to  a 
variety  of  problems  including  system  identification  and  control.  This  work  modifies  EP  by 
embedding  two  additional  search  strategies  in  the  evolutionary  optimization  process.  That 
is,  the  evolutionary  search  algorithm  implemented  in  this  study  augments  standard  EP  by 
incorporating  recombination  of  the  parents  to  generate  offspring  and  by  modifiying  the 
parents  before  any  offspring  are  generated.  The  utilization  of  the  recombination  operator 
produces  an  evolutionary  search  method  similar  to  an  evolution  strategy 10  (ES).  The 
parents  are  modified  using  the  Solis  and  Wets11  random  optimization  method.  The 
resulting  strategy  has  been  termed  a  hybrid  evolutionary  algorithm  and  is  given  below  in 
the  nomenclature  proposed  by  Back  and  Schwfel10.  This  hybrid  approach  appears  to 
incorporate  both  Darwinian  evolutionary  learning  and  Lamarckian  inheritance  models 
which  gives  rise  to  the  term  "Lamarckian  Learning."  As  pointed  out  by  Davidor12, 
"Larmarckism  is  an  important  model  which  can  complement  many  learning  algorithms." 


A  HYBRID-EVOLUTIONARY  SEARCH  ALGORITHM 


k=0; 

initialize'.  P(0)  =  {<1,(0),...,  Clfi  (0)} ; 

where  aj  =  (x;,V/  e{l, ...,«}) 
evaluate  P(0):  <£(/>( 0))  =  {0(rt,  (0)), . . ., (0))}; 
do  { 

modify  parents:  a^k)-  {m,s}  (fl, (&))V7  e{l,...,/i} 
mutate:  a\(k)  =  V/  e{l, ...,//} 

recombine:  (l!'=  r'(P(k)^J  P'(k))\li  €{1 

select:  P(k  +  l)  =  Sijl+/l)(P(k)KjP"(k)); 
k=k+l\ 

}  while  ( i(P(k ))  *  true) 


2.  Evolving  Cascaded  Networks 

The  work  conducted  in  this  study  employs  evolutionary  search  to  optimize  a 
population  of  individual  nodes  and  then  the  best  node  is  brought  into  the  network  in  a 
purely  constructive  manner.  This  contrasts  other  work  in  evolving  networks  which 
focuses  on  the  optimization  of  a  whole  population  of  neural  nets  using  both  constructive 
and  pruning  mechanisms13-14.  Benefits  of  the  single  node  approach  include  lower 
computational  requirements  in  the  form  of  memory  and  processor  power. 

A  population  of  candidate  hidden  units  are  randomly  intialized  with  full 
connectivity  as  in  the  standard  cascade-correlation  learning  architecture  (CCLA).  The 
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same  optimization  objective  used  to  train  CCLA  units  was  employed  to  represent  the 
fitness  of  each  candidate  node 


*(«,)  = 


,(Z„ 


■Z.,XE 


P>o 


E0) 


where  Zp  represents  the  output  of  each  candidate  node  for  pattern/?  and  E  is  the  residual 
error  as  measure  at  the  networks  output  o.  The  weight  vectors  for  each  candidate  unit 
were  modified  using  the  hybrid  method  given  above.  Mutation  of  each  weight  was 
accomplished  using  standard  EP 


W'.=w,y +,/S// <&(«,)-W(0,l) 

where  Sf  represents  the  scaling  factor.  Recombination  took  place  according  to 

=  w,  +A(w;-w,.) 


where  ij  e{J,..,2N},  i*j,  and  A~U(0,1).  Selection  was  determinstic  using  a  (p+p+p) 
strategy  to  choose  the  best  p  individuals  among  the  modified  parents,  the  mutated 
offspring,  and  the  recombined  offspring.  After  an  arbitrary  number  of  generations,  the 
best  candidate  unit  was  inserted  into  the  network. 

The  output  weights  were  found  using  a  more  direct  approach.  The  inputs  to  the 
output  layer  of  the  cascade-correlation  architecture  for  p  patterns  are  described  by 


l 
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x2 

•• 
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Likewise,  the  outputs  of  the  network  are  given  by 

x(l) 

m+/i+2 

X(2) 

m+/>+2 

Y(P)  ...  y I 

X  nt+h+2  ^m+Zi+n  J 


x(l) 

m+/i+n 

x(2) 

m+h+n 


1  m+  h+ 1 

cm 
1  /»!+/»+ 1 


m+  /i+ 1 
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where  Y ^=X„V.  The  optimal  (in  a  least-squares  sense)  weight  set  V  can  be  determined 
from  V  =  (X^X^)-1  X^Y^.  Iterative  deterministic  methods  such  as  the  LMS  rule  can  also 

be  applied  in  determining  the  n-(m+h)  output  weights. 

Finally,  the  complexity  of  the  network  was  monitored  using  an  MDL-like15 
objective  function 


^L=ln(^)+W>>  (PVP 

where  nw  is  the  number  of  weights.  The  number  of  hidden  units  can  be  directly 
incorporated  in  this  calculation  according  to 

Jmd L  =  \n(crl)  +  {n ■  (m  +  h)  +  m ■  h  +  h- (h - 1) /  2)ln(p) /  p 
Ultimately,  one  desires  to  construct  a  network  with  minimal  complexity  cost. 

3.  Results 

The  proposed  algorithm  was  initially  tested  on  the  3-bit  and  4-bit  parity  problems, 
respectively.  Using  the  above  formulation,  solutions  were  found  in  a  relatively  rapid 
manner  for  a  small  number  of  evolutionary  training  cycles  or  generations  (typically  less 
than  10).  Due  to  the  discrete  nature  of  the  problem,  the  fitness  values  climbed  in  a 
precipitous  fashion  using  evolutionary  optimization  with  50-100  parents.  The  parity 
problem  proved  insightful  for  tuning  the  algorithm  in  that  dependence  on  the  scaling  factor 
was  observed.  For  example,  a  Sf=  10  reliably  yielded  architectures  with  only  one  hidden 
node  for  the  3 -bit  parity  problem  or  two  hidden  nodes  for  the  4-bit  parity  problem.  A 
Sf=\  often  resulted  in  networks  with  more  than  the  minimum  number  of  nodes. 

The  next  experiment  investigated  the  two-spirals  problem.  The  two-spirals 
problem  requires  the  network  to  discriminate  between  two  interlocking  spirals  which 
encircle  the  origin  three  times.  This  problem  has  proved  troublesome  for  standard 
architectures  and  training  methods2.  The  two-spirals  are  readily  discriminated  using  the 
CCLA  with  quickprop  training.  Since  these  experiments  replace  the  quickprop  training 
with  stochastic  search  similar  results  were  expected.  The  training  method  outlined  above 
did  manage  to  solve  the  two-spiral  problem,  but  more  than  the  typical  10-15  hidden  units 
were  necessary.  The  need  for  additional  hidden  units  may  have  resulted  from  a  need  for 
better  local  learning  or  increased  behavioral  freedom.  These  hypotheses  are  currently 
under  investigation. 

The  receptive  plane  generated  using  100  parents  (without  the  Solis  and  Wets 
technique),  lanh  activation  functions,  and  a  Sf=l  are  shown  in  Figure  2(a).  The  receptive 
plane  does  not  appear  to  generalize  as  well  the  cascade-correlation  classifier  found  using 
quickprop2  in  that  a  "sausage-link"  effect  occurs  as  opposed  to  a  clean  spiral.  The  number 
of  training  cycles  and  hidden  nodes  were  arbitrarily  limited  to  300  and  24,  respectively,  as 
shown  in  Figure  2(b).  Figure  3  compares  the  optimization  process  in  terms  of  network 
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mean  squared-error  and  complexity  as  each  hidden  node  is  incorporated.  For  this  training 
run,  Figure  3(b)  shows  that  a  flattening  occurs  around  10  hidden  nodes  before  the 
complexity  cost  resumes  its  climb  upon  incorporation  of  additional  hidden  nodes. 


(a)  (b) 

Figure  2.  (a)  The  training  points  are  superimposed  on  the  contour  grid  for  the  two-spiral  problem,  and 
(b)  The  best  fitness  0(a)  in  each  pool  of  candidate  hidden  units  is  plotted  for  the  number  of  nodes  added. 


0 


Figure  3.  (a)  The  mean  squared  network  error,  and  (b)  the  complexity  cost  JMDL  as  new  hidden  nodes 

are  incorporated  into  the  network. 
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4.  Conclusion 


As  more  information  and  insight  are  gained  into  the  dynamics  of  evolutionary 
computation,  it  is  inevitable  that  components  from  the  various  search  strategies  (ES,  EP, 
GA)  will  be  combined  to  yield  fairly  robust  multi-agent  stochastic  search  techniques.  This 
work  has  demonstrated  the  applicability  of  evolutionary  search,  albeit  a  hybrid  approach, 
for  use  in  the  cascade-correlation  learning  architecture.  More  importantly,  this  work 
represents  a  preliminary  step  toward  using  evolutionary  search  in  a  purely  constructive 
manner  wherein  limited  fan-in  random  wired  nodes16  can  be  generated  with  a  randomly 
chosen  activation  function  feF.  These  issues  will  be  addressed  in  subsequent  work. 
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